89 research outputs found
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
The paired de Bruijn graph is an extension of de Bruijn graph incorporating
mate pair information for genome assembly proposed by Mevdedev et al. However,
unlike in an ordinary de Bruijn graph, not every path or cycle in a paired de
Bruijn graph will spell a string, because there is an additional soundness
constraint on the path. In this paper we show that the problem of checking if
there is a sound cycle in a paired de Bruijn graph is NP-hard in general case.
We also explore some of its special cases, as well as a modified version where
the cycle must also pass through every edge.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
An Efficient Algorithm For Chinese Postman Walk on Bi-directed de Bruijn Graphs
Sequence assembly from short reads is an important problem in biology. It is
known that solving the sequence assembly problem exactly on a bi-directed de
Bruijn graph or a string graph is intractable. However finding a Shortest
Double stranded DNA string (SDDNA) containing all the k-long words in the reads
seems to be a good heuristic to get close to the original genome. This problem
is equivalent to finding a cyclic Chinese Postman (CP) walk on the underlying
un-weighted bi-directed de Bruijn graph built from the reads. The Chinese
Postman walk Problem (CPP) is solved by reducing it to a general bi-directed
flow on this graph which runs in O(|E|2 log2(|V |)) time. In this paper we show
that the cyclic CPP on bi-directed graphs can be solved without reducing it to
bi-directed flow. We present a ?(p(|V | + |E|) log(|V |) + (dmaxp)3) time
algorithm to solve the cyclic CPP on a weighted bi-directed de Bruijn graph,
where p = max{|{v|din(v) - dout(v) > 0}|, |{v|din(v) - dout(v) < 0}|} and dmax
= max{|din(v) - dout(v)}. Our algorithm performs asymptotically better than the
bidirected flow algorithm when the number of imbalanced nodes p is much less
than the nodes in the bi-directed graph. From our experimental results on
various datasets, we have noticed that the value of p/|V | lies between 0.08%
and 0.13% with 95% probability
Comparison of Spectra in Unsequenced Species
International audienceWe introduce a new algorithm for the mass spectromet- ric identication of proteins. Experimental spectra obtained by tandem MS/MS are directly compared to theoretical spectra generated from pro- teins of evolutionarily closely related organisms. This work is motivated by the need of a method that allows the identication of proteins of unsequenced species against a database containing proteins of related organisms. The idea is that matching spectra of unknown peptides to very similar MS/MS spectra generated from this database of annotated proteins can lead to annotate unknown proteins. This process is similar to ortholog annotation in protein sequence databases. The difficulty with such an approach is that two similar peptides, even with just one mod- ication (i.e. insertion, deletion or substitution of one or several amino acid(s)) between them, usually generate very dissimilar spectra. In this paper, we present a new dynamic programming based algorithm: Packet- SpectralAlignment. Our algorithm is tolerant to modications and fully exploits two important properties that are usually not considered: the notion of inner symmetry, a relation linking pairs of spectrum peaks, and the notion of packet inside each spectrum to keep related peaks together. Our algorithm, PacketSpectralAlignment is then compared to SpectralAlignment [1] on a dataset of simulated spectra. Our tests show that PacketSpectralAlignment behaves better, in terms of results and execution tim
A Computational Method for the Rate Estimation of Evolutionary Transpositions
Genome rearrangements are evolutionary events that shuffle genomic
architectures. Most frequent genome rearrangements are reversals,
translocations, fusions, and fissions. While there are some more complex genome
rearrangements such as transpositions, they are rarely observed and believed to
constitute only a small fraction of genome rearrangements happening in the
course of evolution. The analysis of transpositions is further obfuscated by
intractability of the underlying computational problems.
We propose a computational method for estimating the rate of transpositions
in evolutionary scenarios between genomes. We applied our method to a set of
mammalian genomes and estimated the transpositions rate in mammalian evolution
to be around 0.26.Comment: Proceedings of the 3rd International Work-Conference on
Bioinformatics and Biomedical Engineering (IWBBIO), 2015. (to appear
Using cascading Bloom filters to improve the memory usage for de Brujin graphs
De Brujin graphs are widely used in bioinformatics for processing
next-generation sequencing data. Due to a very large size of NGS datasets, it
is essential to represent de Bruijn graphs compactly, and several approaches to
this problem have been proposed recently. In this work, we show how to reduce
the memory required by the algorithm of [3] that represents de Brujin graphs
using Bloom filters. Our method requires 30% to 40% less memory with respect to
the method of [3], with insignificant impact to construction time. At the same
time, our experiments showed a better query time compared to [3]. This is, to
our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte
Cerulean: A hybrid assembly using high throughput short and long reads
Genome assembly using high throughput data with short reads, arguably,
remains an unresolvable task in repetitive genomes, since when the length of a
repeat exceeds the read length, it becomes difficult to unambiguously connect
the flanking regions. The emergence of third generation sequencing (Pacific
Biosciences) with long reads enables the opportunity to resolve complicated
repeats that could not be resolved by the short read data. However, these long
reads have high error rate and it is an uphill task to assemble the genome
without using additional high quality short reads. Recently, Koren et al. 2012
proposed an approach to use high quality short reads data to correct these long
reads and, thus, make the assembly from long reads possible. However, due to
the large size of both dataset (short and long reads), error-correction of
these long reads requires excessively high computational resources, even on
small bacterial genomes. In this work, instead of error correction of long
reads, we first assemble the short reads and later map these long reads on the
assembly graph to resolve repeats.
Contribution: We present a hybrid assembly approach that is both
computationally effective and produces high quality assemblies. Our algorithm
first operates with a simplified version of the assembly graph consisting only
of long contigs and gradually improves the assembly by adding smaller contigs
in each iteration. In contrast to the state-of-the-art long reads error
correction technique, which requires high computational resources and long
running time on a supercomputer even for bacterial genome datasets, our
software can produce comparable assembly using only a standard desktop in a
short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Group testing with Random Pools: Phase Transitions and Optimal Strategy
The problem of Group Testing is to identify defective items out of a set of
objects by means of pool queries of the form "Does the pool contain at least a
defective?". The aim is of course to perform detection with the fewest possible
queries, a problem which has relevant practical applications in different
fields including molecular biology and computer science. Here we study GT in
the probabilistic setting focusing on the regime of small defective probability
and large number of objects, and . We construct and
analyze one-stage algorithms for which we establish the occurrence of a
non-detection/detection phase transition resulting in a sharp threshold, , for the number of tests. By optimizing the pool design we construct
algorithms whose detection threshold follows the optimal scaling . Then we consider two-stages algorithms and analyze their
performance for different choices of the first stage pools. In particular, via
a proper random choice of the pools, we construct algorithms which attain the
optimal value (previously determined in Ref. [16]) for the mean number of tests
required for complete detection. We finally discuss the optimal pool design in
the case of finite
Limited Lifespan of Fragile Regions in Mammalian Evolution
An important question in genome evolution is whether there exist fragile
regions (rearrangement hotspots) where chromosomal rearrangements are happening
over and over again. Although nearly all recent studies supported the existence
of fragile regions in mammalian genomes, the most comprehensive phylogenomic
study of mammals (Ma et al. (2006) Genome Research 16, 1557-1565) raised some
doubts about their existence. We demonstrate that fragile regions are subject
to a "birth and death" process, implying that fragility has limited
evolutionary lifespan. This finding implies that fragile regions migrate to
different locations in different mammals, explaining why there exist only a few
chromosomal breakpoints shared between different lineages. The birth and death
of fragile regions phenomenon reinforces the hypothesis that rearrangements are
promoted by matching segmental duplications and suggests putative locations of
the currently active fragile regions in the human genome
- …